Skip to content

Support: surface hung tests on pytest session timeout#721

Merged
ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoWao:support/surface-hung-tests-on-pytest-session-timeout
May 8, 2026
Merged

Support: surface hung tests on pytest session timeout#721
ChaoWao merged 1 commit intohw-native-sys:mainfrom
ChaoWao:support/surface-hung-tests-on-pytest-session-timeout

Conversation

@ChaoWao
Copy link
Copy Markdown
Collaborator

@ChaoWao ChaoWao commented May 8, 2026

Summary

When the parent pytest dispatcher's session watchdog fired
(--pto-session-timeout), the CI log only showed the timeout banner.
The actually-stuck child subprocess was killed silently, its captured
stdout discarded, and the child was left to be reaped as an orphan by
the runner cleanup — see run 25541961688
for the symptom (9.5 min of dead air, then [pytest] TIMEOUT, no
indication which case was stuck).

This PR makes the timeout path self-diagnosing.

Changes

  • parallel_scheduler.py — emit [scheduler] START <label> pid=... devices=...
    at launch (not only at completion). The last START without a matching
    PASS/FAIL identifies the hung case at a glance.
  • parallel_scheduler.py — expose _active_state while run_jobs
    is in flight so the parent's signal handler can reach the live job
    table.
  • conftest.py — on session-timeout SIGALRM:
    1. send SIGUSR1 to every in-flight child (faulthandler, registered
      in pytest_configure, dumps all-thread Python + C stacks into the
      child's stdout — works even when the GIL is held by a native NPU
      call, which Python-level watchdogs cannot reach);
    2. let pumps drain ~2 s so the dumped stacks reach output_lines;
    3. print each in-flight job's tail buffer inside a ::group::HUNG ...
      block;
    4. call _terminate_all so children don't outlive us as orphans.

What the log will look like next time

Normal: every case has a [scheduler] START followed (out of order, in
parallel) by a ::group:: ... [PASS|FAIL ...] block.

On hang: the START for the stuck case has no matching PASS/FAIL, then
[pytest] TIMEOUT: ..., then a ::group::HUNG <label> pid=... elapsed=...
block containing the all-thread traceback the child wrote when it caught
SIGUSR1. The existing CI retry (any non-zero rc → re-run with the pinned
PTO-ISA commit) is unchanged: a PTO-ISA regression that surfaces as a
hang is exactly the case that retry path is meant to recover.

Notes

  • pytest-timeout per-test guard is intentionally not enabled; the
    single-layer "parent catches everything via SIGUSR1" path covers all
    hang shapes (including native NPU deadlocks where pytest-timeout's
    watchdog thread can't fire). If hangs become frequent enough that
    burning the full session budget per incident is painful, adding
    timeout = N / timeout_method = "thread" to pyproject.toml is a
    pure increment on top of this PR.
  • macOS has SIGUSR1; Windows does not. All new SIGUSR1 paths are
    guarded by hasattr(signal, "SIGUSR1").

Testing

  • Local: trigger a hang test (time.sleep(9999)) under
    --pto-session-timeout 30; confirm the CI log shows
    ::group::HUNG ... with traceback, rc=124, no orphan python
    processes after exit.
  • CI: PR validation against simulation runners (no actual hang,
    should look identical to before plus the new START lines).

When the parent pytest dispatcher's session watchdog fired, the log
showed only the timeout banner — the actually-stuck child subprocess
was killed silently, its captured stdout discarded, and the child was
left to be reaped as an orphan by the runner cleanup. Triaging the
stuck case required guessing.

Three changes make the timeout self-diagnosing:

- parallel_scheduler: emit a `[scheduler] START label pid=... devices=...`
  line at launch (not only on completion). The last START without a
  matching PASS/FAIL identifies the stuck case.

- parallel_scheduler: expose `_active_state` while `run_jobs` is in
  flight so the parent's signal handler can reach the live job table.

- conftest: on session-timeout SIGALRM, send SIGUSR1 to every in-flight
  child (faulthandler, registered in `pytest_configure`, dumps all-thread
  Python+C stacks into the child's stdout — works even when the GIL is
  held by a native NPU call), let the pumps drain ~2s, then print each
  in-flight job's tail buffer in a `::group::HUNG ...` block, and finally
  call `_terminate_all` so children don't outlive us as orphans.
@ChaoWao ChaoWao force-pushed the support/surface-hung-tests-on-pytest-session-timeout branch from 9525e7e to fccdced Compare May 8, 2026 09:02
Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request enhances the test session timeout handling by implementing a mechanism to debug hung child processes. It introduces a module-global state in the parallel scheduler to track active jobs, allowing the timeout handler to send SIGUSR1 signals to stuck processes for stack trace generation via faulthandler. The changes also include improved logging for job starts and GitHub Actions-style grouping for hung process output. I have no feedback to provide as there were no review comments.

@ChaoWao ChaoWao merged commit 73b3fff into hw-native-sys:main May 8, 2026
14 checks passed
@ChaoWao ChaoWao deleted the support/surface-hung-tests-on-pytest-session-timeout branch May 8, 2026 09:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant